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ABSTRACT 

High-throughput genome technologies have 
produced a wealth of data on the association of 
genes and gene products to biological functions. 
Investigators have discovered value in combining 
their experimental results with published genome- 
wide association studies, quantitative trait locus, 
microarray, RNA-sequencing and mutant pheno- 
typing studies to identify gene-function associ- 
ations across diverse experiments, species, 
conditions, behaviors or biological processes. 
These experimental results are typically derived 
from disparate data repositories, publication sup- 
plements or reconstructions from primary data 
stores. This leaves bench biologists with the 
complex and unscalable task of integrating data by 
identifying and gathering relevant studies, 
reanalyzing primary data, unifying gene identifiers 
and applying ad hoc computational analysis to the 
integrated set. The freely available GeneWeaver 
(http://www.GeneWeaver.org) powered by the 
Ontological Discovery Environment is a curated re- 
pository of genomic experimental results with an 
accompanying tool set for dynamic integration of 
these data sets, enabling users to interactively 
address questions about sets of biological functions 
and their relations to sets of genes. Thus, large 
numbers of independently published genomic 
results can be organized into new conceptual 
frameworks driven by the underlying, inferred bio- 
logical relationships rather than a pre-existing 
semantic framework. An empirical 'ontology' is dis- 
covered from the aggregate of experimental know- 
ledge around user-defined areas of biological 
inquiry. 



INTRODUCTION 

An increasing number of investigators have integrated 
genome-wide experimental data across studies (1-5), but 
have lacked the data resources, algorithms and tools for 
widespread deployment of this approach. The data comes 
largely from gene expression microarray and now 
RNA-sequencing experiments, quantitative trait locus 
(QTL) mapping and genome-wide association studies 
(GWAS). Additional functional genomic data come 
from mutation and perturbation screen analyses (6-8) 
and from the vast array of individually curated biological 
relationships that have been associated with an array of 
biological ontology terms (e.g. (9-11). There are many 
powerful applications of this strategy, including meta- 
analysis or convergent analysis across microarrays (4), 
the refinement of QTL positional candidates through 
gene-centered evidence (2,12), cross-species translation of 
gene expression (13), mapping model organism findings 
onto human disease (14), using similarity of functional 
associations to identify the function of poorly 
characterized genes and using the similarity of biological 
associations to identify relations among biological 
processes. 

GENEWEAVER: A DATA REPOSITORY AND 
ANALYSIS SYSTEM FOR FUNCTIONAL GENOMIC 
DATA INTEGRATION 

GeneWeaver, powered by the Ontological Discovery 
Environment (15), is a freely available web-based 
software system that enables biologists to perform integra- 
tive functional genomics on their own collections of ex- 
perimental data in combination with the system's 
incorporated database. The system enables users to inte- 
grate genome-wide experimental studies across multiple 
protocols and species (1-5), using convergent evidence to 
find consensus among the noise in reported associations of 
genes and biological functions and to classify functions 
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based on their underlying biological entities. The system 
addresses a major hurdle to the widespread implementa- 
tion of integrative functional genomics by integrating a 
curated data repository with accompanying analysis 
tools (16). 

The GeneWeaver data repository currently contains 
over 48 000 gene sets consisting of over 80000 genes 
from seven species, derived largely from gene expression 
microarrays, RNA-sequencing experiments, QTL 
mapping and GWAS, mutation and perturbation screen 
(6-8), and curated biological relationships (e.g. (9-11)). In 
addition, GeneWeaver integrates other large community 
data sources, such as the drug related gene database 
of the Neuroscience Information Framework (NIF) (17), 
GeneNetwork (18) and the Comparative Toxicogenomics 
Database (CTD) (19). Inclusion of these diverse data re- 
sources, often with incompatible formats, demonstrates 
how GeneWeaver can promote the identification of 
common biological processes of poorly annotated genes 
through data integration without the overhead of 
extensive reanalysis. 

Unlike ontology over-representation analysis systems 
[e.g. DAVID (20)] that match a single gene set result to 
many curated associations of genes to ontology terms or 
pathways, GeneWeaver infers relations among genes and 
processes by set-set matching across all gene sets in a 
query. This enables users to examine similar and distinct 
results among a large number of related experiments 
(including those compiled into ontology annotations), 
and to compare the role of genes to each of the functions 
assessed in the various experiments. 

At its core, GeneWeaver achieves this data convergence 
through use of discrete gene-function associations, where 
gene lists are converted into an edge list of 
gene-to-biological function relationships where gene set 
descriptors are loosely defined as a 'biological function.' 
The resulting pair-wise relationships between genes and 
functions are queried using multivariate statistical 
methods and novel algorithms based on graph data 
mining including biclique, clique and paraclique analysis, 
clustering of the similarity matrix, Jaccard and 
hypergeometric tests, graph clustering algorithms such as 
dominating set and a novel algorithm for inference of an 
empirically derived ontology of gene functional relations, 
the Phenome Map (15). 



APPLICATIONS AND USE OF GENEWEAVER 

A visitor to GeneWeaver lands on a public page with links 
to data management, advanced search and analysis tools. 
All users have access to numerous tutorials, including 
videos, How To guides and our standards documentation 
(see Supplemental Data, http://geneweaver.org/index. 
php?action = help). While anonymous users have access 
to the GeneWeaver's curated data repository, an 
account is required to store uploaded data, view data 
shared directly among registered users and their groups 
and to create and share individual projects. The system 
consists of a database containing curated gene set data 
and community metadata, both of which interact with 



a variety of discovery and data management tools 
(Figure 1). A user of the site typically begins their 
analysis with a search for a term or gene of interest. 
When a user enters a free text term, the default is a 
full-text search of meta-data fields including descriptions, 
publication information and NCBO Annotator (21) 
derived MeSH and Disease Ontology (22) terms. When 
a user searches for a gene symbol or other gene identifier, 
any gene set containing that gene symbol, its homologs or 
any identifier mapped onto that identifier by major model 
organism databases is retrieved. Users have tremendous 
flexibility in specifying searches through field-restricted 
queries and Boolean search. 

For example, a search for 'Drd2' retrieves gene sets 
including expression in the lateral septal complex from 
the Allen Mouse Brain Atlas (GS 127928), positional can- 
didates for mechanical sensitivity and for methampheta- 
mine response in a mouse QTL analyses (GS84202), 
QTL analysis for alcohol consumption (GS84212, 
GS84216), mutant phenotypic alleles for abnormal 
alcohol consumption from Mouse Genome Informatics 
(GS87659), Gene Ontology annotations for post 
synaptic density (GS97454), paraquat interacting genes 
from CTD (GS121514) and 712 other gene sets. A query 
for ethanol retrieves 540 Gene Sets including genetic 
studies of alcoholism in humans (e.g. GS46979, 
GS46980, GS46982, GS46984, GS46985), mice 
(GS37188) and rats (GS37196, GS37197). Clicking on a 
Gene Set name opens a detail page that enables users to 
view the metadata or contents of a gene set, translate 
the identifiers, execute overrepresentation analyses via 
GAGGLE (23) and find similar gene sets. Users can 
select appropriate gene sets using check boxes and add 
them to a 'Project' using 'Add Selected to Project.' The 
user should then select 'Analyze Gene Sets' to perform 
integrative analyses using tools described below 
(Figure 2). The forms under 'Manage Gene Sets' allow 
users to upload their own gene sets and add them to 
new or existing projects for seamless integrative analysis 
with the data in the repository. 



GENEWEAVER ANALYSIS TOOLS AND 
FUNCTIONS 

User selected gene sets are combined through identifier 
cross-mapping to enable analysis of discrete relationships 
between genes and functions using a wide variety of tools. 
Each of the embedded tools functions independently to 
explore and prioritize genes and gene sets of interest. 
The inputs and outputs are described as genes or gene 
sets (Table 1). The modular strategy allows users to 
arrange tools in numerous distinct workflows. Most of 
these tools operate on the representation of a group 
of gene lists as a bipartite graph consisting of genes as 
one set of vertices and gene sets as a second set of 
vertices. The edges between these vertices are weighted 
by scores or unweighted discrete values (15). To execute 
an analysis from the 'Analyze Gene Sets' page, first select 
a project using the check box next to its name, or expand 
the project to select individual gene sets within it. 
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Figure 1. Curation and integrative analysis of secondary data in the Ontological Discovery Environment. The overall system architecture consists of 
a centralized database that collects a variety of curated data and metadata and serves a suite of analysis tools. It uses a data from community 
resources to create clusters of gene homology across supported species, enabling ODE to rapidly translate gene sets. 



Then select a tool using the icons to the left. Clicking a 
tool icon immediately executes the default analysis, but 
tool options may be set by first expanding the options 
by clicking on the plus sign next to the icon. Most tools 
have options for the inclusion of homologous genes and 
for setting thresholds and analysis parameters. Several 
tools also allow users to highlight nodes containing 
'emphasis genes,' a user specified list of genes of particular 
interest. 

The gene set graph 

The gene set graph tool is a module that enables users to 
identify genes and gene sets that are highly connected 
among an input group of gene sets. Gene Sets are repre- 
sented as a central column of vertices, with high degree 
(highly connected) gene vertices plotted to the right and 
lower degree gene vertices plotted to the left (Figure 3, 
Supplementary Figure SI). Users control the minimum 
degree threshold and can incorporate the 'emphasis 
genes' feature. 

The phenome graph 

The phenome graph is a hierarchical network of multiway 
gene set intersections (Figure 4, Supplementary Figure 
S2). The result is a graph of gene set intersections of 
very high order, enabling users to find genes connected 
to all populated subsets of an input set of gene lists. 
In algorithmic terms, these intersections are created from 
the overlap of maximal bicliques in discrete bipartite 



structures (15). This tool creates an 'empirical ontology' 
of function based on the underlying genes within curated 
and user-defined gene sets in that similar gene sets, regard- 
less of semantic annotation, are joined through similar- 
ity of their contents. Our fast implementation of 
bipartite algorithms (24) allows GeneWeaver to create 
very large hierarchies from over 100 typical inputs and 
our integration of homology and identifier translation 
tables enables integration across species and experimental 
systems. To ensure robustness of the result and reduce 
the size of the output graph, users may prune resulting 
hierarchies using bootstrapping procedures and stopping 
rules based on leaf contents (such as the minimum 
number of genes or gene sets in each node) or graph 
depth. 

Anchored bicliques of biomolecular associates 

The ABBA tool is an analysis module used to find genes 
that have similar functional associations to the members 
of a query set of genes. When a list of genes is input into 
ABBA, the tool generates a list of all gene sets from within 
the GeneWeaver database that contain a user determined 
number of the input genes. ABBA then returns a ranked 
list of similar genes that are enriched among the same 
Gene Sets as the input genes (see Supplementary Figures 
S3 and S4 for sample use cases). The user may designate a 
connectivity threshold for the number of gene sets that 
contain the predicted genes. This tool, for example, has 
been used to successfully identify genes that may be 



D1070 Nucleic Acids Research, 2012, Vol. 40, Database issue 



Gene Weaver 

A system for the integration of functional genomics experiments. 



Home Search ▼ Manage GeneSets ▼ Analyze GeneSets ▼ About Help 



Logged in as Elissa 
Edit Groups | Logout 



Analyze GeneSets 

Analysis Tools 



vie* GeneSet; 



This tool uses the Bdque 
algorithm to generate a tree of 
phenotype-gene relations based 
on the genesets selected. 

Visualize the Gene-GeneSet 
graph. 



Options for GeneSet Graph 

Homology: 



© Included 
0 Excluded 



MinDegree: ^ Auto ~T) 

SupressDisconnected: O Off © On 
More info... (Run} 



This tool computes the Jaccard 
Coeffcient {a measure of 
similarity) for multiple genesets. 



Perform Ftsher's exact test against 
the Hypergeometric distribution 
on ail pairwise GeneSet 
inter sectons 



-■fflf**S3 Thts tool computes the Jaccard 

- WT ■ Distance (a measure of 

H: IB dissimilarity) for mulbpte genesets, 

clustering men uses them to cluster 



Project Utilities 



• Combine 

• Boolean Algebra 

• Emphasis Genes Ouse 



Help | Feedback 



□ = AlCOhOl - 131 GeneSetS Rename 

Add notes 

0 c u J GS46985: Alcohol Dependence in Netherlands Twin Registry Sample 1 'Remove 

0 GS46984: Alcohol Dependence in Netherlands Study of Depression and Anxiety ^Remove 
(NESDA) Sample 

0 * GS46982: Meta-Analysis of Alcohol Dependence in Australian and Dutch Samples ^Remove 

GS46980: Comorbid Alcohol and Nicotine Dependence in Australian DNA-Pools '^Remove 

GS46979: Alcohol Dependence in Australian DNA-Pools 1 Remove 

GS37197: Positional candidate on Chromosome 12 for alcohol consumption in ^Remove 
male HXB/BXH RI rat strains. 

0 GS37196: Positional candidate on Chromosome 6 for alcohol consumption in male ^Remove 
HXB/BXH RI rat strains. 

GS37194: Positional candidate on Chromosome 1 for alcohol consumption in male ^Remove 
HXB/BXH RI rat strains. 

& c w J GS37188: Positional candidate on Chromosome 11 (30-110 Mb) for dominant ^Remove 
deviation measuring EtOH consumption during DID, 24 hour access and BEC. 

D "fr GS37187: Positional candidate on chromosome 11 (59- 79Mb) for overdominant ^Remove 
effect for 24-hour, 2 bottle choice 30g/kg EtOH excessive consumption. 

O Q GS37184: Positional candidate on chromosome 9 ( 49-109 Mb) for overdominant ^Remove 
effect for 24-hour, 2 bottle choice 30g/kg EtOH excessive consumption. 

O 4" GS37147: Gene expression change in the nucleus accumbens, following ^Ranon 
continuous alcohol consumption in alcohol preferring rats. 

□ 4> GS37104: CynoMac_PFC_Sscores_Filtered " rcdm 

□ * GS37103: CynoMac_NAC_Sscores_Rltered Re™e 

□ * GS37102: CynoMac_HPC_Sscores_Filtered ; Re™c 
O 4> GS37101: CynoMac_Amy_Sscores_Filtered Remove 

□ 4> GS36963: MacMulat_NAc_Sscore_HuU133EntrezGene rctcm 



Figure 2. The analyze gene sets page. GeneWeaver's analysis functions are accessed from this page. Gene sets must first be collected and stored into 
one or more projects by the user. In this case, a project called 'Alcohol' contains 121 gene sets, nine of which are selected for analysis using the tools 
on the right. Options can be selected from this tool bar prior to executing the tool. 



functionally similar to genes associated with autism within 
the MGI database (25). 

Gene set similarity 

Gene Set similarity is estimated for a single set against the 
entire database, executed from an individual gene set page, 
or it can be analyzed using pair-wise similarity tools, 
Jaccard Similarity (Supplementary Figure S5) and 
Hypergeometric test. These two tools operate on a user 
selected set of gene sets and produce both a matrix of 
similarity statistics and a matrix of pair-wise Venn 
diagrams for all Gene Sets in the analysis. Each operates 
on a reference gene set consisting only of those genes 
which can be mapped across identifiers from one gene 
set to the other. 

Boolean gene set functions 

Although our bipartite graph algorithms are both 
complete and scalable, the ability to handle large query 
results, construct inputs, interpret and visualize large 



graphical results is still a challenge for human users. A 
set of Boolean Gene Set Logic features enable users to 
reduce large query results. For example, a user may 
identify a set of overlapping quantitative trait loci, each 
for a related biological phenotype. A common positional 
candidate is most likely, so using the gene set intersection 
feature, users can reduce the group of QTL candidate gene 
lists to a single gene set containing only those genes in a 
high order intersection of the QTLs. Likewise, users may 
want to distill large sets of genes for comparison of related 
functions. By taking the union of all genes that are repre- 
sented among groups of user selected gene sets, compari- 
sons of smaller numbers of gene sets can be performed. 
For example, to identify the similarities and differences 
among genes associated with broad classes of psychologic- 
al disorders, a user could take the union of all genes 
associated with major classes of conditions: mood dis- 
orders, alcoholism, anxiety disorders and chronic stress 
and then use the Phenome Graph function to examine 
high-order intersections among them (Supplementary 
Figure S6). 
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Table 1. Analysis tools and basic functions available in Gene Weaver 
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Degree 
threshold 
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Degree 
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All functions are available from the 'Analyze GeneSets' page, except for ABBA and 'Find similar gene sets,' which are accessed from the 'Search for 
Genes' menu and 'GeneSet Pages,' respectively. Emphasis genes may be accessed from either 'Analyze GeneSets' or any 'Gene Set Page'. 



Gene set similarity clustering 

GeneWeaver employs multivariate clustering of the 
Jaccard gene set similarity matrix to rapidly collapse 
large gene sets and arrange them into hierarchical 
clusters. A number of common clustering algorithms are 
included (Supplementary Figure S7). This tool enables 
users to rapidly identify groups of similar and dissimilar 
inputs that can be used to perform important quality 
control for redundant inputs among a set of user- 
submitted data. 



DATA (JURATION 

It is critical for any bioinformatics resource to contain 
traceable data, ample documentation and a mechanism 
for reproducible results. The GeneWeaver database 
consists of curated data from a variety of sources 
(Supplementary Figures S9-S14, Supplementary Tables 
S4 and S5), and all data shared publicly on the site must 
pass curatorial review. As a final check on the accuracy of 



data in the database, users are given sufficient information 
to be able to retrieve or regenerate the underlying Gene 
Set and may submit their corrected version to our curator. 

Each gene set in GeneWeaver is stored with descriptive 
metadata at several levels, including publication informa- 
tion, reference gene identifiers, free text descriptions and 
structured term annotations for disease, biological process 
and experimental context. Controlled description of 
the gene sets are user submitted by tree-based term 
navigation and selection from the Open Biomedical 
Ontologies [Mouse Anatomy, Gene Ontology, 
Mammalian Phenotype Ontology (26), MeSH]; additional 
terms are suggested using the NCBO Annotator tool. 
GeneWeaver's curator reviews these terms for accuracy. 
These levels provide human and machine interpretable in- 
formation about each functional genomics experiment to 
describe the analysis process by which a result was origin- 
ally generated. Free-text 'Gene Set names' succinctly 
describe the contents of a set. A fuller 'description' field 
provides a detailed description of the data generation and 
rules for inclusion on a gene set list. An 'abbreviation' field 
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Figure 3. The gene set graph. The gene set graph reveals the highly connected genes among the nine gene sets selected in Figure 2. This analysis 
reveals DDX5 as the most highly connected gene, connected to both human and mouse alcohol-related measures. Inset: clicking on a gene node 
executes a search for gene sets containing the featured gene or its homologs. Clicking on a gene set node reveals the contents and metadata for that 
gene set. 



provides a compact recognizable label for graphics. 
Finally, publication information is obtained to enable 
users to readily retrieve complete information about the 
underlying experiment. 

GeneWeaver has a systematic curation process to 
supply validated secondary data for analysis tools. 
Internal curation is driven by directed research, largely 
related to alcoholism, addiction and behavioral disorders, 
but the workflow may be applied to multiple interest 
groups. From literature, data curation starts with term 
identification, which is subsequently crossed with 



genome experimentation key words such as 'GWAS,' 
'QTL,' 'microarray' and 'RNAseq.' Combined terms are 
used in MEDLINE searches to identify publications 
which may contain secondary functional genomics 
results. Each publication is checked against the existing 
database and manually scanned for secondary data, 
including supplemental files. Appropriate data sets 
are uploaded along with descriptive metadata. 
Descriptions are written to contain sufficient information 
to enable users to readily retrieve the original source data 
and to access documentation for the individual studies. 
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Figure 4. The phenome graph. The phenome graph drawn from nine inputs selected in Figure 2. The phenome graph is a directed acyclic graph of 
the intersections of gene sets. Each node represents gene sets and the genes they share. Higher order intersections are represented in the root nodes at 
the top, and individual gene sets in the leaves at the bottom. Inset: clicking a node opens a page showing the intersections among gene sets in list 
form. Results from this page can be sent to other tools for annotation, including GAGGLE. 



The NCBO Annotator is then applied to suggest MeSH 
terms, Disease Ontology, Gene Ontology, Mammalian 
Phenotype Ontology and other structured vocabulary 
terms that may be appropriate to the data set. These 
terms are reviewed by the curator to comprise the 
final entry. 



Gene sets may also be submitted by users for public 
access via the GeneWeaver database. Each of these is 
marked provisional until the curator reviews them by 
first searching for similar sets from the same publication, 
and searching for similar sets by contents to identify du- 
plicate entries. The curator then determines whether 
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the description is compliant with our standards document 
(see Supplementary Data), and finally assesses whether 
the gene set conforms to the literature and/or descrip- 
tion provided. Finally, the NCBO Annotator is applied 
to identify structured annotations to the gene set. 
Ultimately, the user may wish to reanalyze data from 
the original source and evaluate the appropriateness of 
particular source data for their analyses. Sufficient 
details and links to external sites are provided for this 
purpose. 

A large amount of Gene Weaver data comes from major 
bioinformatics resources including NCBI, ENSEMBL 
and various model organism databases, including MGD 
(9), Rat Genome Database [RGD (27)], HUGO Gene 
Nomenclature Committee [HGNC (28)], Saccharomyces 
Genome Database [SGD (29)], FlyBase (30), WormBase 

(31) and the Zebrafish Model Organism Database [ZFIN 

(32) ]. Some of these data are converted to gene sets, 
including GO and MP annotations, Comparative 
Toxicogenomics Database (33) associations and QTL pos- 
itional candidates from RGD and MGI. These data 
sources are updated every 6 months. A large accessory 
data warehouse contains the gene identifier mappings 
obtained from Homologene and the model organism 
databases. A table of update dates is found 
at http://geneweaver.org/index.php7action = help&cmd = 
updates, and acquisition dates of individual gene sets are 
indicated at the level of the gene set. When a newer version 
of a gene set in a user project exists, the older version is 
flagged as 'deprecated,' and the user has the opportunity 
to update the stored set to the latest version. 

Each time the data warehouse is updated, gene identi- 
fiers are mapped onto one another using the mapping 
provided by each species' model organism database, and 
assigned a unique ID. Users specify the source species of 
their upload, and the uploaded list is compared to the gene 
identifiers in the database for that species to identify the 
appropriate mapping to gene identifiers. Gene symbols are 
ambiguous and users are cautioned not to use this method 
if at all possible. When an ambiguous symbol, i.e. one for 
which there are multiple gene ids from the Generic Model 
Organism Database (GMOD), is uploaded, the user is 
notified and all of the corresponding gene IDs are 
assigned to the set. This inherent noise in the user input 
is filtered out through convergent data analysis, provided 
that multiple instances of the precise identifier occur, 
whereas spurious associations are unlikely to recur by 
chance. 

A multitiered system denotes the level of curation 
applied to each data set, allowing users to limit the 
scope of their analysis as desired. All data obtained 
from established community resources such as MP and 
GO annotations to genes and Kyoto Encyclopedia of 
Genes and Genomes pathway members are classified as 
Tier I, Public Research Resource data and is updated on 
a 6-month cycle. Pathway data (e.g. in KEGG), is con- 
verted to gene lists representing the gene products in each 
pathway. Gene Ontology annotations are converted to 
gene lists through transitive closure such that the genes 
annotated to any child node are also annotated to 
parent nodes successively up each branch of the 



ontology. A similar process is used to convert phenotypic 
alleles annotated to the Mammalian Phenotype Ontology. 
Each allele is assigned to the gene symbol for which it is 
allelic, and that symbol is associated to gene lists corres- 
ponding to each MPO term. Transitive closure is again 
used to assign gene symbols to parents of any child term 
in the MPO. Data version and accession date are 
integrated into the metadata for these records. 

Tier II consists of human-curated machine-generated 
data, such as phenotype to gene correlation data in refer- 
ence populations and QTL positional candidates or 
GWAS positional candidates. GeneWeaver's curation 
team examines the fidelity of these large data sets to our 
standards documentation and assigns appropriate 
metadata to them. Tier III data consists of curated 
results of individual functional experimental studies, typ- 
ically obtained from publications. Tier IV, or provisional 
data, is submitted data, advanced by users for public 
sharing but still pending review. It is exposed to search 
and analytic modules, but can be filtered or excluded 
readily. Curators examine provisional data for its adher- 
ence to our standards documentation for metadata 
quality, redundancy with existing data or unusual thresh- 
old artifacts, before moving it to Tier III, human curated. 
Tier V consists of data explicitly for private, personal or 
group use and is not shared with the larger community. 

Users and groups 

In order to tightly couple analysis to the collation and 
curation of user-submitted data, ODE allows guests to 
create individual accounts, user-defined groups and 
associated projects. Automated graded-access is built 
into analysis tools to control data sharing across users 
and groups. Any gene set may be maintained as private 
and excluded from our curation Tiers, although it is 
strongly suggested that users make every effort to meet 
requirements outlined in the standards documentation 
(see Supplementay Data) to facilitate interpretation of 
analysis results by users and group members. Projects 
provide an effective means to sort and store related data 
sets and query results over time. Version control of data in 
user accounts is also provided. Data curation is a dynamic 
process. It is possible that user selected genes or gene sets 
may change status over time with some data being 
deprecated. Legacy data are not removed from the reposi- 
tory, but, data sets containing deprecated genes or 
gene-function associations are clearly noted and may or 
may not be updated at the user's discretion to maintain 
continuity and reproducibility of results through time. 

Community contribution to the Gene Weaver repository 

Users are encouraged to augment the repository by up- 
loading data specific to their research question. Data 
upload is in the form of a simple two-column tab delimited 
text file consisting of one header row. The left column is a 
list of the gene symbols or identifiers, and the second 
column is a score that can later be used to threshold 
genes of interest. These scores may be the results of stat- 
istical analyses such as P-values, ^-values, correlation 
metrics and effect sizes or they may be binary values 
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representing list membership. The user is prompted to 
provide a threshold for the scores. This flexibility 
enables the aggregation of highly diverse results using 
combinatorial algorithms, while providing sufficient infor- 
mation for future development of more formal 
meta-analysis modules. During the upload process, users 
are also prompted to assign metadata. This step is impera- 
tive as it allows discovery of relevant gene sets for integra- 
tion with existing data. Required metadata includes 
ontology assignment, PubMed IDs, a textual description 
of the data and appropriate labeling. If the data set is put 
forward as 'Public 1 it is immediately assigned to Tier IV 
and marked 'Provisional' until a curator can verify that 
the data and metadata meet our standards protocol for 
upgrade to Tier III. 

RESOURCE DISSEMINATION 

The Gene Weaver analysis system is made available on the 
GitHub community code repository under the GPL 
license. The system includes HTML, AJAX, SVG pages, 
a PostgreSQL database, Python and C modules for 
analysis. Outputs include PDF graphics export, along 
with XML, GAGGLE micro-formats and tab delimited 
text for data interoperability. A standalone GeneWeaver 
package is currently unavailable because the data, tools 
and schema framework are highly system-dependent and 
intended as an Internet-based analysis framework. 
However, we are developing a set of APIs to allow users 
to operate on and contribute to our secondary data envir- 
onment. Tools are documented in a variety of formats 
including a movie demo, tutorial exercise, quick start 
PDF, tool tips and interactive help. A web-based form 
and email links provide individualized user support. 

FUTURE DIRECTIONS 

The underlying framework of GeneWeaver is extensible 
on several fronts. Tool development is centered on 
providing users better approaches to managing the scope 
of output that the site provides. This will be accomplished 
through further development of simple text output that 
can be analyzed using other tools, and in the development 
of additional software for local processing including 
FGPU assisted processing of larger visual output. 
GeneWeaver may be augmented by our user community 
through the development of additional modules that 
operate on the gene-phenotype graph which can be 
piloted on machine readable output generated from the 
GeneWeaver site. New species are included through the 
incorporation of additional annotated genomes and 
homology tables, and new gene identifiers, particularly 
for microarray platforms, are incorporated through the 
use of the GEO IDs (34). GeneWeaver has a relational 
back-end designed to scale vastly beyond the current 
data and to accommodate biomolecular entities other 
than genes. We are working to expand the underlying 
data structure to accept discrete relationships between bio- 
logical objects that are outside of the concept of gene. For 
example, ongoing development will allow users to submit 



RNA, SNP or methylation data in conjunction with 
assigned biological functions without having to generalize 
these data to gene symbols. Our intention is to generalize 
the secondary database to a solution that will incorporate 
most relevant current and future technologies for 
genome-wide function characterization. 

Data sets from publications representing genome-wide 
studies in functional biology are continuously being added 
and curated. This includes the active and passive solicita- 
tion of user-submitted data. The goal is to continue to 
identify quality public research resource data for their po- 
tential incorporation into ODE. Rich documentation 
and active training of users will facilitate the best use of 
the analysis tools by biological researchers. 



CONCLUSIONS 

GeneWeaver provides a real-time, interactive and exten- 
sible environment for user-initiated integrative analysis of 
functional genomics data. It contains a significant reposi- 
tory of curated multispecies gene sets assigned to biologic- 
al function and associated with community metadata. 
Access to GeneWeaver allows users to efficiently ask ques- 
tions about the underlying functional genomics of 
complex biological systems by spanning secondary data 
covering multiple species and experimental protocols. By 
leveraging our large curated repository of discrete gene set 
relationships users can easily integrate their own function- 
al genomics studies and apply robust analysis tools to 
discover new knowledge without the burden of having to 
collect, reanalyze and collate appropriate comparative 
studies. 

When systematic data curation is applied in conjunction 
with Gene Weaver's analysis tools, a consistent framework 
can be imposed on the perceived disorder of diverse sec- 
ondary data in any area of functional genomics analysis, 
ensuring scalability, data quality and a common vocabu- 
lary. By rendering these data computable, we retrieve 
valuable biological information from functional genomic 
studies that would otherwise be relegated to the challenges 
of diverse, poorly standardized, yet publicly available 
data. 



SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online: 
Supplementary Tables 1-5, Supplementary Figures 1-14. 
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